Football Manager 24: Feyenoord Moneyball¶

Exploratory Analysis¶

image-5.png

In [1]:
import pandas as pd

# Load the CSV file
file_path = '/Users/JumpMan/Downloads/FM Data Analytics/Feyanoord/Moneyball Scouting/2023-24/2023-24 Eredivese.csv'
data = pd.read_csv(file_path)
In [2]:
# Check the first few rows
print("First few rows of the dataset:")
data.head()
First few rows of the dataset:
Out[2]:
Name Position Age Height Weight Club Division Nationality Home-Grown Personality ... Attacking Midfielder Creative Winger Attacking Winger Creative Forward Attacking Forward Finisher Aerial Threat Reader Assister Ball Winning Defenders
0 Luuk de Jong ST (C) 33 6'2" 185 lbs PSV Eredivisie NED (SUI) Trained in nation (15-21) Fairly Professional ... 84 66 75 78 81 97 64 25 46 12
1 Sergiño Dest D/WB (RL), M (R) 23 5'7" 136 lbs PSV Eredivisie USA (NED) Trained in nation (15-21) Balanced ... 48 67 53 49 46 35 63 83 55 71
2 Seiya Maikuma D/WB/AM (R), ST (C) 26 5'10" 152 lbs AZ Eredivisie JPN - Balanced ... 93 91 85 78 87 82 64 48 78 31
3 Bas Dost ST (C) 34 6'5" 180 lbs FC Groningen Eredivisie NED Trained in nation (15-21) Spirited ... 75 52 58 64 74 98 83 32 49 55
4 Calvin Stengs M (R), AM (RC) 25 6'0" 149 lbs Feyenoord Eredivisie NED (SUR) Trained in nation (15-21) Fairly Loyal ... 98 100 99 98 99 82 57 45 96 10

5 rows × 210 columns

In [3]:
# Overview of the dataset
print("\nDataset Information:")
data.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 952 entries, 0 to 951
Columns: 210 entries, Name to Ball Winning Defenders
dtypes: float64(2), int64(194), object(14)
memory usage: 1.5+ MB
In [4]:
# Summary statistics for numerical columns
print("\nSummary Statistics for Numerical Columns:")
data.describe()
Summary Statistics for Numerical Columns:
Out[4]:
Age Starts Minutes Played Average Rating Sub Appearances Minutes/Game Goals (percentile) Goals/90 (percentile) Minutes/Goal (percentile) xG (percentile) ... Attacking Midfielder Creative Winger Attacking Winger Creative Forward Attacking Forward Finisher Aerial Threat Reader Assister Ball Winning Defenders
count 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 ... 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000 952.000000
mean 21.017857 6.421218 576.169118 2.915504 2.026261 25.387605 8.726891 10.797269 11.134454 17.646008 ... 22.147059 22.161765 21.953782 21.967437 21.962185 21.278361 21.207983 21.887605 20.227941 21.697479
std 4.743549 11.149500 966.876113 3.363326 4.187803 33.643253 23.209986 24.582441 24.847649 29.408106 ... 31.379590 31.397574 31.517758 31.524697 31.523107 31.635164 31.861009 31.521945 32.042879 31.118391
min 14.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 20.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 23.000000 7.250000 781.250000 6.730000 2.000000 52.222500 0.000000 0.000000 0.000000 29.000000 ... 44.000000 44.000000 43.250000 44.000000 43.250000 44.000000 43.250000 44.000000 43.000000 35.000000
max 39.000000 43.000000 3870.000000 7.520000 22.000000 90.000000 100.000000 100.000000 100.000000 100.000000 ... 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000

8 rows × 196 columns

In [5]:
# Check for any missing values
print("\nMissing Values in Each Column:")
data.isnull().sum()
Missing Values in Each Column:
Out[5]:
Name                      0
Position                  0
Age                       0
Height                    0
Weight                    0
                         ..
Finisher                  0
Aerial Threat             0
Reader                    0
Assister                  0
Ball Winning Defenders    0
Length: 210, dtype: int64
In [6]:
# Distribution of values in key categorical columns, like Position and Club
print("\nUnique Values in 'Position' Column:")
data['Position'].value_counts()
Unique Values in 'Position' Column:
Out[6]:
Position
GK                              114
D (C)                            87
ST (C)                           62
DM, M (C)                        51
M/AM (C)                         46
                               ... 
D/WB (R), DM, M (RC), AM (R)      1
D (RC), WB (R), DM, M (RC)        1
M (LC), AM (C)                    1
D (RL), WB/M (R)                  1
D/WB/AM (R), ST (C)               1
Name: count, Length: 127, dtype: int64
In [7]:
print("\nUnique Values in 'Club' Column:")
data['Club'].value_counts()
Unique Values in 'Club' Column:
Out[7]:
Club
Feyenoord           89
PSV                 68
sc Heerenveen       58
PEC Zwolle          58
FC Groningen        56
Go Ahead Eagles     55
N.E.C. Nijmegen     52
NAC Breda           49
Willem II           48
FC Volendam         48
Excelsior           46
Fortuna Sittard     42
FC Utrecht          42
Vitesse             38
Sparta Rotterdam    35
AZ                  34
RKC Waalwijk        33
Almere City         32
Heracles Almelo     32
FC Twente           30
Ajax Amateurs        6
Ajax                 1
Name: count, dtype: int64
In [8]:
# Print all column names
print("Column Names:")
for col in data.columns:
    print(col)
Column Names:
Name
Position
Age
Height
Weight
Club
Division
Nationality
Home-Grown
Personality
Media Handling
Wage
Transfer Value
Asking Price
Preferred Foot
Starts
Minutes Played
Average Rating
Sub Appearances
Minutes/Game
Goals (percentile)
Goals/90 (percentile)
Minutes/Goal (percentile)
xG (percentile)
xG/90 (percentile)
xG/Shot (percentile)
xG Overperformance (percentile)
xG Overperformance/90 (percentile)
Non-pen Goals (percentile)
Non-pen Goals/90 (percentile)
Non-pen Goals/Shot (percentile)
Minutes/Non-pen Goal (percentile)
Non-pen xG (percentile)
Non-pen xG/90 (percentile)
Non-pen Goals - Non-pens xG /90 (percentile)
Non-pen xG/Shot (percentile)
Non-pen xG Overperformance (percentile)
Non-pen xG Overperformance/90 (percentile)
Goals Outside Box (percentile)
Goals Outside Box/90 (percentile)
Assists (percentile)
Assists/90 (percentile)
Minutes/Assist (percentile)
xA (percentile)
xA/90 (percentile)
xA Overperformance (percentile)
xA Overperformance/90 (percentile)
Assists/Clear Cut Chances Created (percentile)
Goal Contributions (percentile)
Goal Contributions/90 (percentile)
xGC (percentile)
xGC/90 (percentile)
xGC Overperformance (percentile)
xGC Overperformance/90 (percentile)
Non-pen Goal Contributions (percentile)
Non-pen Goal Contributions/90 (percentile)
Non-pen xGC (percentile)
Non-pen xGC/90 (percentile)
Non-pen xGC Overperformance (percentile)
Non-pen xGC Overperformance/90 (percentile)
Conversion % (percentile)
Shots (percentile)
Shots/90 (percentile)
Shots on Target (percentile)
Shots on Target/90 (percentile)
Shots on Target % (percentile)
Shots Outside Box/90 (percentile)
Passes Attempted (percentile)
Passes Attempted/90 (percentile)
Passes Completed (percentile)
Passes Completed/90 (percentile)
Pass Completion % (percentile)
Progressive Passes/90 (percentile)
Progressive Passes (percentile)
Progressive Pass Rate (percentile)
Key Passes (percentile)
Key Passes/90 (percentile)
Key Pass % (percentile)
Open Play Key Passes (percentile)
Open Play Key Passes/90 (percentile)
Open Play Key Pass % (percentile)
Crosses Attempted (percentile)
Crosses Attempted/90 (percentile)
Crosses Completed (percentile)
Crosses Completed/90 (percentile)
Crosses Completed % (percentile)
Open Play Crosses Attempted (percentile)
Open Play Crosses Attempted/90 (percentile)
Open Play Crosses Completed (percentile)
Open Play Crosses Completed/90 (percentile)
Open Play Cross Completion % (percentile)
Chances Created (percentile)
Chances Created/90 (percentile)
Clear Cut Chances Created (percentile)
Clear Cut Chances Created/90 (percentile)
Pressures Attempted (percentile)
Pressures Attempted/90 (percentile)
Pressures Completed (percentile)
Pressures Completed/90 (percentile)
Pressure Success % (percentile)
Possession Won/90 (percentile)
Possession Lost/90 (percentile)
Poss+-/90 (percentile)
Poss+- % (percentile)
Dribbles/90 (percentile)
Dribbles (percentile)
Penalties Taken (percentile)
Penalties Scored (percentile)
Pens Scored % (percentile)
Tackles Attempted (percentile)
Tackles Attempted/90 (percentile)
Tackles Completed (percentile)
Tackles Completed/90 (percentile)
Tackles Failed (percentile)
Tackle Completion % (percentile)
Tackles Failed/90 (percentile)
Key Tackles (percentile)
Key Tackles/90 (percentile)
Tackle Quality (percentile)
Interceptions (percentile)
Interceptions/90 (percentile)
Blocks (percentile)
Blocks/90 (percentile)
Shots Blocked (percentile)
Shots Blocked/90 (percentile)
Headers Attempted (percentile)
Headers Attempted/90 (percentile)
Headers Won (percentile)
Headers Won/90 (percentile)
Headers Won % (percentile)
Headers Lost (percentile)
Headers Lost/90 (percentile)
Headers Lost % (percentile)
Key Headers (percentile)
Key Headers/90 (percentile)
Aerial Challenges Attempted/90 (percentile)
Duels Win % (percentile)
Fouls Against (percentile)
Fouls Made (percentile)
Net Fouls (percentile)
Fouls Won/90 (percentile)
Fouls Committed/90 (percentile)
Clearances (percentile)
Clearances/90 (percentile)
Offsides (percentile)
Offsides/90 (percentile)
Offside/Non-pen Goals (percentile)
Offside/Non-pen xG (percentile)
Distance Covered/90 (percentile)
Distance Covered (percentile)
Total Saves (percentile)
Saves/90 (percentile)
Save % (percentile)
xSave % (percentile)
xSave % Overperformance (percentile)
Saves Held (percentile)
Saves Held/90 (percentile)
Saves Held Ratio (percentile)
Saves Held/Shots Faced Ratio (percentile)
Saves Tipped (percentile)
Saves Tipped/90 (percentile)
Saves Tipped Ratio (percentile)
Saves Tipped/Shots Faced Ratio (percentile)
Saves Parried (percentile)
Saves Parried/90 (percentile)
Saves Parried Ratio (percentile)
Saves Parried/Shots Faced Ratio (percentile)
Saves/Goal Conceaded (percentile)
Save Efficiency (percentile)
Shots on Target Against (percentile)
Shots on Target Against/90 (percentile)
xGP (percentile)
xGP/90 (percentile)
Penalties Faced (percentile)
Penalties Saved (percentile)
Pens Saved % (percentile)
Goals Conceded (percentile)
Conceded/90 (percentile)
Clean Sheets (percentile)
Clean Sheet Ratio (percentile)
Red Cards (percentile)
Yellow Cards (percentile)
Yellows/Tackle (percentile)
Reds/Tackle (percentile)
Yellows/90 (percentile)
Reds/90 (percentile)
Player of the Match (percentile)
Mistakes Leading to Goal (percentile)
Sprints/90 (percentile)
Attacking Actions/90 (percentile)
Creative Actions/90 (percentile)
Defensive Actions/90 (percentile)
Goalkeeping Actions/90 (percentile)
Excitement Factor/90 (percentile)
General Performance
Goalkeeping
Defensive Defender
Creative Defender
Attacking Defender
Creative Midfielder
Attacking Midfielder
Creative Winger
Attacking Winger
Creative Forward
Attacking Forward
Finisher
Aerial Threat
Reader
Assister
Ball Winning Defenders
In [9]:
# Check for duplicate column names
duplicate_columns = data.columns[data.columns.duplicated()].tolist()
if duplicate_columns:
    print("Duplicate columns found:", duplicate_columns)

# Drop duplicate columns, keeping the first occurrence
data = data.loc[:, ~data.columns.duplicated()]

# Columns to keep in addition to the metric
additional_columns = ['Name', 'Position', 'Age', 'Height', 'Weight', 'Club', 
                      'Division', 'Nationality', 'Personality', 'Media Handling', 
                      'Wage', 'Transfer Value', 'Asking Price', 'Preferred Foot', 'Minutes Played']

# Function to get the top players for a specific metric with a minimum Minutes Played filter
def get_top_players(metric, min_minutes=500, top_n=5):
    if metric not in data.columns:
        print(f"Metric '{metric}' not found in data.")
        return None
    if 'Minutes Played' not in data.columns:
        print("Column 'Minutes Played' not found in data.")
        return None
    
    # Filter players based on the Minutes Played threshold
    filtered_data = data[data['Minutes Played'] > min_minutes]
    
    # Select the additional columns and the specified metric, then get the top players
    top_players = filtered_data[additional_columns + [metric]].sort_values(by=metric, ascending=False).head(top_n)
    return top_players

# Example usage
metric = 'Non-pen Goals (percentile)'  # Replace with the metric of your choice
min_minutes = 500  # Set the minimum minutes played
top_n = 5  # Number of top players to return

top_players_df = get_top_players(metric, min_minutes=min_minutes, top_n=top_n)

# Display the resulting DataFrame
top_players_df
Out[9]:
Name Position Age Height Weight Club Division Nationality Personality Media Handling Wage Transfer Value Asking Price Preferred Foot Minutes Played Non-pen Goals (percentile)
10 Santiago Giménez ST (C) 23 6'0" 152 lbs Feyenoord Eredivisie MEX (ARG) Fairly Determined Level-headed £21,500 p/w £54M - £60M - Left 2811 100
3 Bas Dost ST (C) 34 6'5" 180 lbs FC Groningen Eredivisie NED Spirited Media-friendly £3,600 p/w £160K - £1.6M - Right 2882 100
0 Luuk de Jong ST (C) 33 6'2" 185 lbs PSV Eredivisie NED (SUI) Fairly Professional Level-headed £41,000 p/w Not for Sale - Right 3203 98
221 Victor Edvardsen AM (RL), ST (C) 28 6'1" 191 lbs Go Ahead Eagles Eredivisie SWE Balanced Media-friendly £2,800 p/w £550K - £6.2M - Right Only 2951 98
290 Kevin van Kippersluis M/AM (RLC), ST (C) 30 6'1" 163 lbs Ajax Amateurs Dutch Vierde Divisie A NED (GER) Balanced Media-friendly £0 p/w £0 - Left Only 3870 98

Interactive Player Analysis Tool¶

This code creates an interactive tool for analyzing football player statistics using pandas and ipywidgets in a Jupyter notebook environment. Here's a detailed breakdown of its functionality:

Data Preparation¶

  1. The code starts by loading player data from a CSV file into a pandas DataFrame.
  2. It checks for and removes any duplicate columns in the dataset.
  3. A list of additional columns (player attributes) is defined to be displayed alongside the selected metrics.

Widget Setup¶

The tool uses several ipywidgets to create an interactive interface:

  1. metric_select: A multiple-selection widget that allows users to choose which metrics to analyze. It only includes numeric columns that end with '(percentile)'.
  2. min_minutes_slider: A slider to set the minimum number of minutes played by players to be included in the analysis.
  3. top_n_slider: A slider to determine how many top players to display in the results.

Dynamic Metric Input Fields¶

The create_metric_inputs function dynamically generates input fields for each selected metric:

  1. When metrics are selected or deselected, this function updates the interface.
  2. For each selected metric, it creates two input fields: one for the minimum value and one for the maximum value.
  3. These input fields allow users to set specific ranges for each metric they're interested in.

Data Filtering and Display¶

The get_top_players_interactive function is the core of the analysis:

  1. It filters the data based on the minimum minutes played.
  2. For each selected metric, it further filters the data to only include players within the specified range.
  3. If no players match all criteria, it displays a message asking the user to adjust their filters.
  4. The function then sorts the filtered data based on the first selected metric.
  5. It selects the top N players as specified by the user.
  6. Finally, it displays a table with the selected players, showing only the chosen metrics and additional player information.

Interactive Display¶

The code uses widgets.interactive to create a responsive interface that updates the results whenever the user changes any input (selected metrics, minimum minutes, number of top players, or metric ranges).

Layout¶

The final layout combines all widgets into a vertical box (VBox) for a clean, user-friendly interface:

  1. The metric selection box is placed side-by-side with the sliders for minimum minutes and top N players.
  2. Below this, the dynamically generated metric input fields are displayed.
  3. The results table is shown at the bottom, updating in real-time as the user adjusts their inputs.

This tool provides a powerful and flexible way to analyze player performance across various metrics, allowing for quick identification of top performers based on specific criteria.

2023-24 Moneyball Analysis¶

Eredivise Analysis¶

image.png

In [10]:
import pandas as pd
from IPython.display import display
import ipywidgets as widgets

# Load your data
data = pd.read_csv('/Users/JumpMan/Downloads/FM Data Analytics/Feyanoord/Moneyball Scouting/2023-24/2023-24 Eredivese.csv')

# Check for duplicate column names
duplicate_columns = data.columns[data.columns.duplicated()].tolist()
if duplicate_columns:
    print("Duplicate columns found:", duplicate_columns)

# Drop duplicate columns, keeping the first occurrence
data = data.loc[:, ~data.columns.duplicated()]

# Columns to keep in addition to the metrics
additional_columns = ['Name', 'Position', 'Age', 'Height', 'Weight', 'Club', 
                      'Division', 'Nationality', 'Personality', 'Media Handling', 
                      'Wage', 'Transfer Value', 'Asking Price', 'Preferred Foot', 'Minutes Played']

# Define the widget elements without the "(percentile)" filter for troubleshooting
metric_select = widgets.SelectMultiple(
    options=[col for col in data.columns if data[col].dtype in ['float64', 'int64']],
    description='Metrics',
    disabled=False,
    layout=widgets.Layout(width='50%', height='300px')
)

min_minutes_slider = widgets.IntSlider(value=500, min=0, max=3000, step=100, description='Min Minutes')
top_n_slider = widgets.IntSlider(value=5, min=1, max=20, description='Top N')

# Search input for player name
player_search = widgets.Text(description='Player Name', placeholder='Type player name here')

# Dropdown filters for categorical fields
team_dropdown = widgets.Dropdown(
    options=[''] + sorted(data['Club'].dropna().unique().tolist()),
    description='Team'
)
league_dropdown = widgets.Dropdown(
    options=[''] + sorted(data['Division'].dropna().unique().tolist()),
    description='League'
)
nationality_dropdown = widgets.Dropdown(
    options=[''] + sorted(data['Nationality'].dropna().unique().tolist()),
    description='Nationality'
)

# Dictionary to store metric input fields
metric_inputs = {}

# Function to create metric input fields
def create_metric_inputs(change):
    # Clear existing widgets in metric_inputs
    for metric, inputs in metric_inputs.items():
        inputs[0].close()
        inputs[1].close()
    metric_inputs.clear()

    # Update the VBox children with the new selected metrics
    inputs_vbox.children = []
    for metric in change['new']:
        min_input = widgets.FloatText(value=0, description='Min', step=1)
        max_input = widgets.FloatText(value=100, description='Max', step=1)
        metric_inputs[metric] = (min_input, max_input)
        inputs_vbox.children += (widgets.HBox([widgets.Label(metric), min_input, max_input]),)

# Create a VBox to hold the metric input fields
inputs_vbox = widgets.VBox([])

# Function to get the top players based on selected metrics, filters, and search criteria
def get_top_players_interactive(selected_metrics, min_minutes, top_n, player_name, team, league, nationality):
    # Start with the full dataset
    filtered_data = data

    # Filter by player name if provided
    if player_name:
        filtered_data = filtered_data[filtered_data['Name'].str.contains(player_name, case=False, na=False)]
        if filtered_data.empty:
            print(f"No player found with the name '{player_name}'.")
            return None

    # Filter by team if selected
    if team:
        filtered_data = filtered_data[filtered_data['Club'] == team]
        if filtered_data.empty:
            print(f"No players found for the team '{team}'.")
            return None

    # Filter by league if selected
    if league:
        filtered_data = filtered_data[filtered_data['Division'] == league]
        if filtered_data.empty:
            print(f"No players found in the league '{league}'.")
            return None

    # Filter by nationality if selected
    if nationality:
        filtered_data = filtered_data[filtered_data['Nationality'] == nationality]
        if filtered_data.empty:
            print(f"No players found from the nationality '{nationality}'.")
            return None

    # Further filter by minimum minutes played
    filtered_data = filtered_data[filtered_data['Minutes Played'] > min_minutes]

    # Apply min/max filters for each metric
    for metric, (min_input, max_input) in metric_inputs.items():
        filtered_data = filtered_data[
            (filtered_data[metric] >= min_input.value) & 
            (filtered_data[metric] <= max_input.value)
        ]

    if filtered_data.empty:
        print("No players match all criteria. Try adjusting your filters.")
        return None

    # Sort by the first selected metric
    if selected_metrics:
        sorted_data = filtered_data.sort_values(by=selected_metrics[0], ascending=False)
    else:
        sorted_data = filtered_data

    # Select top N players
    top_players = sorted_data.head(top_n)

    # Display only the selected metrics and additional columns
    columns_to_display = additional_columns + list(selected_metrics)
    display(top_players[columns_to_display])

    return top_players

# Connect the metric selection to input field creation
metric_select.observe(create_metric_inputs, names='value')

# Set up widgets for interactive functionality
interactive_output = widgets.interactive_output(
    get_top_players_interactive,
    {'selected_metrics': metric_select, 'min_minutes': min_minutes_slider, 'top_n': top_n_slider, 
     'player_name': player_search, 'team': team_dropdown, 'league': league_dropdown, 'nationality': nationality_dropdown}
)

# Combine all widgets into the final layout
final_widget = widgets.VBox([
    widgets.HBox([metric_select, widgets.VBox([min_minutes_slider, top_n_slider])]),
    widgets.HBox([player_search, team_dropdown, league_dropdown, nationality_dropdown]),
    inputs_vbox,
    interactive_output
])

display(final_widget)
VBox(children=(HBox(children=(SelectMultiple(description='Metrics', layout=Layout(height='300px', width='50%')…

Top 7 European Leagues¶

image.png

In [11]:
### import pandas as pd
from IPython.display import display
import ipywidgets as widgets

# Load your data
data_europe = pd.read_csv('//Users/JumpMan/Downloads/FM Data Analytics/Feyanoord/Moneyball Scouting/2023-24/Top 7 leagues in europe.html.csv')

# Check for duplicate column names
duplicate_columns = data_europe.columns[data_europe.columns.duplicated()].tolist()
if duplicate_columns:
    print("Duplicate columns found:", duplicate_columns)

# Drop duplicate columns, keeping the first occurrence
data_europe = data_europe.loc[:, ~data_europe.columns.duplicated()]

# Columns to keep in addition to the metrics
additional_columns = ['Name', 'Position', 'Age', 'Height', 'Weight', 'Club', 
                      'Division', 'Nationality', 'Personality', 'Media Handling', 
                      'Wage', 'Transfer Value', 'Asking Price', 'Preferred Foot', 'Minutes Played']

# Define the widget elements without the "(percentile)" filter for troubleshooting
metric_select = widgets.SelectMultiple(
    options=[col for col in data_europe.columns if data_europe[col].dtype in ['float64', 'int64']],
    description='Metrics',
    disabled=False,
    layout=widgets.Layout(width='50%', height='300px')
)

min_minutes_slider = widgets.IntSlider(value=500, min=0, max=3000, step=100, description='Min Minutes')
top_n_slider = widgets.IntSlider(value=5, min=1, max=20, description='Top N')

# Search input for player name
player_search = widgets.Text(description='Player Name', placeholder='Type player name here')

# Dropdown filters for categorical fields
team_dropdown = widgets.Dropdown(
    options=[''] + sorted(data_europe['Club'].dropna().unique().tolist()),
    description='Team'
)
league_dropdown = widgets.Dropdown(
    options=[''] + sorted(data_europe['Division'].dropna().unique().tolist()),
    description='League'
)
nationality_dropdown = widgets.Dropdown(
    options=[''] + sorted(data_europe['Nationality'].dropna().unique().tolist()),
    description='Nationality'
)

# Dictionary to store metric input fields
metric_inputs = {}

# Function to create metric input fields
def create_metric_inputs(change):
    # Clear existing widgets in metric_inputs
    for metric, inputs in metric_inputs.items():
        inputs[0].close()
        inputs[1].close()
    metric_inputs.clear()

    # Update the VBox children with the new selected metrics
    inputs_vbox.children = []
    for metric in change['new']:
        min_input = widgets.FloatText(value=0, description='Min', step=1)
        max_input = widgets.FloatText(value=100, description='Max', step=1)
        metric_inputs[metric] = (min_input, max_input)
        inputs_vbox.children += (widgets.HBox([widgets.Label(metric), min_input, max_input]),)

# Create a VBox to hold the metric input fields
inputs_vbox = widgets.VBox([])

# Function to get the top players based on selected metrics, filters, and search criteria
def get_top_players_interactive(selected_metrics, min_minutes, top_n, player_name, team, league, nationality):
    # Start with the full dataset
    filtered_data = data_europe

    # Filter by player name if provided
    if player_name:
        filtered_data = filtered_data[filtered_data['Name'].str.contains(player_name, case=False, na=False)]
        if filtered_data.empty:
            print(f"No player found with the name '{player_name}'.")
            return None

    # Filter by team if selected
    if team:
        filtered_data = filtered_data[filtered_data['Club'] == team]
        if filtered_data.empty:
            print(f"No players found for the team '{team}'.")
            return None

    # Filter by league if selected
    if league:
        filtered_data = filtered_data[filtered_data['Division'] == league]
        if filtered_data.empty:
            print(f"No players found in the league '{league}'.")
            return None

    # Filter by nationality if selected
    if nationality:
        filtered_data = filtered_data[filtered_data['Nationality'] == nationality]
        if filtered_data.empty:
            print(f"No players found from the nationality '{nationality}'.")
            return None

    # Further filter by minimum minutes played
    filtered_data = filtered_data[filtered_data['Minutes Played'] > min_minutes]

    # Apply min/max filters for each metric
    for metric, (min_input, max_input) in metric_inputs.items():
        filtered_data = filtered_data[
            (filtered_data[metric] >= min_input.value) & 
            (filtered_data[metric] <= max_input.value)
        ]

    if filtered_data.empty:
        print("No players match all criteria. Try adjusting your filters.")
        return None

    # Sort by the first selected metric
    if selected_metrics:
        sorted_data = filtered_data.sort_values(by=selected_metrics[0], ascending=False)
    else:
        sorted_data = filtered_data

    # Select top N players
    top_players = sorted_data.head(top_n)

    # Display only the selected metrics and additional columns
    columns_to_display = additional_columns + list(selected_metrics)
    display(top_players[columns_to_display])

    return top_players

# Connect the metric selection to input field creation
metric_select.observe(create_metric_inputs, names='value')

# Set up widgets for interactive functionality
interactive_output = widgets.interactive_output(
    get_top_players_interactive,
    {'selected_metrics': metric_select, 'min_minutes': min_minutes_slider, 'top_n': top_n_slider, 
     'player_name': player_search, 'team': team_dropdown, 'league': league_dropdown, 'nationality': nationality_dropdown}
)

# Combine all widgets into the final layout
final_widget = widgets.VBox([
    widgets.HBox([metric_select, widgets.VBox([min_minutes_slider, top_n_slider])]),
    widgets.HBox([player_search, team_dropdown, league_dropdown, nationality_dropdown]),
    inputs_vbox,
    interactive_output
])

display(final_widget)
VBox(children=(HBox(children=(SelectMultiple(description='Metrics', layout=Layout(height='300px', width='50%')…

Player Comparison Analysis¶

image-3.png

Attacking Metric Analysis¶

In [38]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd

def dynamic_clustering(data, features, n_clusters):
    # Select the specified features from the data
    data_selected = data[features]
    
    # Normalize the data using StandardScaler
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(data_selected)
    
    # Apply KMeans clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(data_scaled)
    
    # Add cluster assignments to the data
    data['Cluster'] = clusters
    
    # Apply PCA for dimensionality reduction
    pca = PCA(n_components=2)
    pca_components = pca.fit_transform(data_scaled)
    
    # Add PCA components to the data
    data['PCA1'] = pca_components[:, 0]
    data['PCA2'] = pca_components[:, 1]
    
    # Calculate cluster centroids in PCA space
    # To calculate centroids in the PCA space, we need to use the 2 PCA components
    pca_centroids = pca.transform(kmeans.cluster_centers_)
    
    # Convert centroids to a DataFrame and use PCA1 and PCA2 for the columns
    centroids = pd.DataFrame(pca_centroids, columns=['PCA1', 'PCA2'])
    
    return data, centroids
In [32]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import plotly.express as px
import pandas as pd

# Define attacking metrics
attacking_metrics = [
    'Goals (percentile)', 'Goals/90 (percentile)', 'Minutes/Goal (percentile)', 'xG (percentile)',
    'xG/90 (percentile)', 'xG/Shot (percentile)', 'Non-pen Goals (percentile)', 'Non-pen Goals/90 (percentile)',
    'Non-pen Goals/Shot (percentile)', 'Non-pen xG (percentile)', 'Non-pen xG/90 (percentile)', 'Shots (percentile)',
    'Shots/90 (percentile)', 'Shots on Target (percentile)', 'Shots on Target/90 (percentile)', 'Shots on Target % (percentile)',
    'Shots Outside Box/90 (percentile)', 'Goal Contributions (percentile)', 'Goal Contributions/90 (percentile)',
    'Goals Outside Box (percentile)', 'Goals Outside Box/90 (percentile)', 'Conversion % (percentile)', 
    'Penalties Taken (percentile)', 'Penalties Scored (percentile)', 'Pens Scored % (percentile)'
]

# Call the dynamic clustering function for attacking metrics
attacking_clustered, attacking_centroids = dynamic_clustering(
    data=data_europe,  # Updated variable name
    features=attacking_metrics,
    n_clusters=4
)

# Calculate a combined score from the attacking metrics (average in this case)
attacking_clustered['Combined Score'] = attacking_clustered[attacking_metrics].mean(axis=1)

# Sort players by combined score to get the top performers
top_5_attacking = attacking_clustered.nlargest(5, 'Combined Score')

# Display the results
print("Top 5 Attacking Players based on combined score:")
display(top_5_attacking)

# Interactive Plotly Visualization for Attacking Metrics
fig1 = px.scatter(
    attacking_clustered,
    x='PCA1',
    y='PCA2',
    color='Cluster',
    symbol='Division',  # Use the correct column name for league
    size='Minutes Played',
    hover_data=['Name', 'Age', 'Cluster', 'Position', 'Minutes Played', 'Division'],
    title='Interactive Attacking Player Clustering'
)
fig1.update_layout(
    title_x=0.5,  # Center the title
    width=900,
    height=600,
    legend=dict(
        x=1.05,  # Adjust horizontal position (right of the plot)
        y=1,     # Adjust vertical position (top of the plot)
        xanchor="left",  # Anchor legend box to the left
        yanchor="top",   # Anchor legend box to the top
        title=dict(text="Cluster")
    ),
    paper_bgcolor='rgb(243, 243, 243)',  # Light background for clean look
    plot_bgcolor='rgba(0,0,0,0)'  # Transparent plot background
)

# Show the Plot
#fig1.show()  # Uncomment to show the plot

# Display Cluster Centroids for Attacking Metrics
print("Attacking Cluster Centroids:")
display(attacking_centroids)
/Users/JumpMan/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

Top 5 Attacking Players based on combined score:
Name Position Age Height Weight Inf Club Division Nationality Home-Grown ... Attacking Forward Finisher Aerial Threat Reader Assister Ball Winning Defenders Cluster PCA1 PCA2 Combined Score
4 Erling Haaland ST (C) 23 6'5" 205 lbs - Man City English Premier Division NOR (ENG) - ... 97 99 76 37 91 52 3 14.533494 12.559919 91.64
6 Harry Kane AM (C), ST (C) 30 6'2" 189 lbs - FC Bayern Bundesliga ENG (IRL) - ... 99 99 77 36 86 53 3 14.272450 10.437637 89.92
0 Kylian Mbappé AM (RL), ST (C) 25 5'10" 160 lbs - R. Madrid Spanish First Division FRA - ... 100 99 46 29 97 37 3 14.140771 10.753558 89.00
153 Victor Boniface ST (C) 23 6'3" 200 lbs - Bayer 04 Bundesliga NGA - ... 83 97 74 29 50 28 3 13.979868 11.647960 88.92
55 Romelu Lukaku ST (C) 31 6'3" 205 lbs - Parthenope Italian Serie A BEL (COD) - ... 95 98 70 32 96 13 3 13.753185 10.052220 88.08

5 rows × 215 columns

Attacking Cluster Centroids:
PCA1 PCA2
0 -2.588840 0.308740
1 7.441651 -0.862183
2 1.224423 -0.907023
3 11.167452 8.361225

Attacking Score Search

In [34]:
def search_player_stats(player_name, attacking_clustered, attacking_metrics):
    """
    Search for a player by name and return their attacking stats and combined attacking score.

    Args:
    player_name (str): Name of the player to search for.
    attacking_clustered (DataFrame): The DataFrame with player clustering data.
    attacking_metrics (list): The list of attacking metrics.

    Returns:
    dict or str: Player's stats and combined score, or message if player is not found.
    """
    # Filter the DataFrame for the player by name (case-insensitive)
    player_data = attacking_clustered[attacking_clustered['Name'].str.contains(player_name, case=False, na=False)]
    
    # If player is found, return their stats and combined score
    if not player_data.empty:
        player_stats = player_data[attacking_metrics + ['Combined Score']].iloc[0].to_dict()
        return player_stats
    return f"Player '{player_name}' not found."

# User input to search for a player
player_name = input("Enter the player's name: ")

# Get player stats and combined score
player_stats = search_player_stats(player_name, attacking_clustered, attacking_metrics)

# Display the result
if isinstance(player_stats, dict):
    print(f"\nStats for {player_name}:")
    # Using list comprehension for concise output
    print("\n".join([f"{stat}: {value}" for stat, value in player_stats.items()]))
else:
    print(player_stats)
Enter the player's name: Foden

Stats for Foden:
Goals (percentile): 95.0
Goals/90 (percentile): 73.0
Minutes/Goal (percentile): 73.0
xG (percentile): 98.0
xG/90 (percentile): 79.0
xG/Shot (percentile): 60.0
Non-pen Goals (percentile): 96.0
Non-pen Goals/90 (percentile): 71.0
Non-pen Goals/Shot (percentile): 51.0
Non-pen xG (percentile): 97.0
Non-pen xG/90 (percentile): 78.0
Shots (percentile): 98.0
Shots/90 (percentile): 81.0
Shots on Target (percentile): 98.0
Shots on Target/90 (percentile): 84.0
Shots on Target % (percentile): 82.0
Shots Outside Box/90 (percentile): 72.0
Goal Contributions (percentile): 99.0
Goal Contributions/90 (percentile): 89.0
Goals Outside Box (percentile): 0.0
Goals Outside Box/90 (percentile): 0.0
Conversion % (percentile): 55.0
Penalties Taken (percentile): 60.0
Penalties Scored (percentile): 0.0
Pens Scored % (percentile): 1.0
Combined Score: 67.6

Creative Metric Analysis¶

image.png

In [36]:
# Define creative metrics
creative_metrics = [
    'Assists (percentile)', 'Assists/90 (percentile)', 'xA (percentile)', 'xA/90 (percentile)', 'xA Overperformance (percentile)',
    'Key Passes (percentile)', 'Key Passes/90 (percentile)', 'Key Pass % (percentile)', 'Open Play Key Passes (percentile)',
    'Chances Created (percentile)', 'Chances Created/90 (percentile)', 'Clear Cut Chances Created (percentile)',
    'Clear Cut Chances Created/90 (percentile)', 'Progressive Passes/90 (percentile)', 'Progressive Passes (percentile)',
    'Pass Completion % (percentile)', 'Crosses Attempted (percentile)', 'Crosses Attempted/90 (percentile)',
    'Crosses Completed (percentile)', 'Crosses Completed/90 (percentile)', 'Passes Attempted (percentile)', 
    'Passes Attempted/90 (percentile)'
]

# Call the dynamic clustering function for creative metrics
creative_clustered, creative_centroids = dynamic_clustering(
    data=data_europe,  # Updated variable name
    features=creative_metrics,
    n_clusters=4
)

# Calculate a combined score from the creative metrics (average in this case)
creative_clustered['Combined Score'] = creative_clustered[creative_metrics].mean(axis=1)

# Sort players by combined score to get the top performers
top_5_creative = creative_clustered.nlargest(5, 'Combined Score')

# Display the results
print("Top 5 Creative Players based on combined score:")
display(top_5_creative)

# Interactive Plotly Visualization for Creative Metrics
fig2 = px.scatter(
    creative_clustered,
    x='PCA1',
    y='PCA2',
    color='Cluster',
    symbol='Division',  # Use the correct column name for league
    size='Minutes Played',
    hover_data=['Name', 'Age', 'Cluster', 'Position', 'Minutes Played', 'Division'],
    title='Interactive Creative Player Clustering'
)
fig2.update_layout(
    title_x=0.5,
    width=900,
    height=600,
    legend=dict(
        x=1.05,
        y=1,
        xanchor="left",
        yanchor="top",
        title=dict(text="Cluster")
    ),
    paper_bgcolor='rgb(243, 243, 243)',  # Light background for clean look
    plot_bgcolor='rgba(0,0,0,0)'
)

# Show the Plot
#fig2.show()

# Display Cluster Centroids for Creative Metrics
print("Creative Cluster Centroids:")
display(creative_centroids)
/Users/JumpMan/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

Top 5 Creative Players based on combined score:
Name Position Age Height Weight Inf Club Division Nationality Home-Grown ... Attacking Forward Finisher Aerial Threat Reader Assister Ball Winning Defenders Cluster PCA1 PCA2 Combined Score
94 Hakan Çalhanoğlu DM, M/AM (C) 30 5'10" 152 lbs - Inter Italian Serie A TUR (GER) - ... 79 47 27 54 99 65 1 10.514077 -1.236828 93.318182
66 Luka Modrić DM, M/AM (C) 38 5'8" 145 lbs - R. Madrid Spanish First Division CRO - ... 94 58 19 37 99 53 1 10.563184 -2.395553 93.000000
34 Martin Ødegaard M/AM (C) 25 5'10" 149 lbs - Arsenal English Premier Division NOR - ... 99 90 19 47 98 54 1 10.566436 -1.976729 90.545455
585 Pascal Groß D (R), DM, M/AM (C) 32 5'11" 167 lbs - Borussia Dortmund Bundesliga GER - ... 90 56 60 73 100 82 1 10.550302 -0.704693 90.500000
23 Joshua Kimmich D/WB (R), DM, M (C) 29 5'10" 165 lbs - Paris SG Ligue 1 Uber Eats GER - ... 76 33 57 58 99 71 1 9.941278 -1.914790 89.363636

5 rows × 215 columns

Creative Cluster Centroids:
PCA1 PCA2
0 2.188440 -0.251752
1 6.578570 -0.717672
2 0.887482 2.745893
3 -2.960795 -0.377333

Creative Score Search

In [37]:
def search_player_stats(player_name, creative_clustered, creative_metrics):
    """
    Search for a player by name and return their creative stats and combined creative score.

    Args:
    player_name (str): Name of the player to search for.
    creative_clustered (DataFrame): The DataFrame with player clustering data.
    creative_metrics (list): The list of creative metrics.

    Returns:
    dict or str: Player's stats and combined score, or message if player is not found.
    """
    # Filter the DataFrame for the player by name (case-insensitive)
    player_data = creative_clustered[creative_clustered['Name'].str.contains(player_name, case=False, na=False)]
    
    # If player is found, return their stats and combined score
    if not player_data.empty:
        player_stats = player_data[creative_metrics + ['Combined Score']].iloc[0].to_dict()
        return player_stats
    return f"Player '{player_name}' not found."

# User input to search for a player
player_name = input("Enter the player's name: ")

# Get player stats and combined score
player_stats = search_player_stats(player_name, creative_clustered, creative_metrics)

# Display the result
if isinstance(player_stats, dict):
    print(f"\nStats for {player_name}:")
    # Using list comprehension for concise output
    print("\n".join([f"{stat}: {value}" for stat, value in player_stats.items()]))
else:
    print(player_stats)
Enter the player's name: Foden

Stats for Foden:
Assists (percentile): 99.0
Assists/90 (percentile): 89.0
xA (percentile): 99.0
xA/90 (percentile): 89.0
xA Overperformance (percentile): 99.0
Key Passes (percentile): 98.0
Key Passes/90 (percentile): 83.0
Key Pass % (percentile): 80.0
Open Play Key Passes (percentile): 99.0
Chances Created (percentile): 99.0
Chances Created/90 (percentile): 93.0
Clear Cut Chances Created (percentile): 100.0
Clear Cut Chances Created/90 (percentile): 94.0
Progressive Passes/90 (percentile): 44.0
Progressive Passes (percentile): 87.0
Pass Completion % (percentile): 34.0
Crosses Attempted (percentile): 93.0
Crosses Attempted/90 (percentile): 68.0
Crosses Completed (percentile): 81.0
Crosses Completed/90 (percentile): 51.0
Passes Attempted (percentile): 89.0
Passes Attempted/90 (percentile): 44.0
Combined Score: 82.36363636363636

Defensive Metric Analysis¶

image.png

In [20]:
# Define defensive metrics
defensive_metrics = [
    'Tackles Attempted (percentile)', 'Tackles Attempted/90 (percentile)', 'Tackles Completed (percentile)',
    'Tackles Completed/90 (percentile)', 'Interceptions (percentile)', 'Interceptions/90 (percentile)', 'Clearances (percentile)',
    'Clearances/90 (percentile)', 'Blocks (percentile)', 'Blocks/90 (percentile)', 'Key Tackles (percentile)', 
    'Key Tackles/90 (percentile)', 'Tackle Completion % (percentile)', 'Defensive Actions/90 (percentile)',
    'Fouls Committed/90 (percentile)', 'Possession Won/90 (percentile)', 'Duels Win % (percentile)', 'Headers Won % (percentile)'
]

# Call the dynamic clustering function for defensive metrics
defensive_clustered, defensive_centroids = dynamic_clustering(
    data=data_europe,  # Updated variable name
    features=defensive_metrics,
    n_clusters=4
)

# Calculate a combined score from the defensive metrics (average in this case)
defensive_clustered['Combined Score'] = defensive_clustered[defensive_metrics].mean(axis=1)

# Sort players by combined score to get the top performers
top_5_defensive = defensive_clustered.nlargest(5, 'Combined Score')

# Display the results
print("Top 5 Defensive Players based on combined score:")
display(top_5_defensive)

# Interactive Plotly Visualization for Defensive Metrics
fig3 = px.scatter(
    defensive_clustered,
    x='PCA1',
    y='PCA2',
    color='Cluster',
    symbol='Division',  # Use the correct column name for league
    size='Minutes Played',
    hover_data=['Name', 'Age', 'Cluster', 'Position', 'Minutes Played', 'Division'],
    title='Interactive Defensive Player Clustering'
)
fig3.update_layout(
    title_x=0.5,
    width=900,
    height=600,
    legend=dict(
        x=1.05,
        y=1,
        xanchor="left",
        yanchor="top",
        title=dict(text="Cluster")
    ),
    paper_bgcolor='rgb(243, 243, 243)',  # Light background for clean look
    plot_bgcolor='rgba(0,0,0,0)'
)

# Show the Plot
#fig3.show()

# Display Cluster Centroids for Defensive Metrics
print("Defensive Cluster Centroids:")
display(defensive_centroids)
/Users/JumpMan/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

Top 5 Defensive Players based on combined score:
Name Position Age Height Weight Inf Club Division Nationality Home-Grown ... Attacking Forward Finisher Aerial Threat Reader Assister Ball Winning Defenders Cluster PCA1 PCA2 Combined Score
1892 Django Warmerdam D/WB (L), DM, M (LC) 28 5'11" 160 lbs - Excelsior Eredivisie NED Trained in nation (15-21) ... 48 54 95 99 44 88 0 8.397846 0.601474 86.055556
2903 Bas Kuipers D/WB (L) 29 5'11" 163 lbs - FC Twente Eredivisie NED Trained in nation (15-21) ... 73 53 85 99 97 87 0 8.121900 1.558674 85.166667
1774 Jesús Vázquez D/WB/M/AM (L) 21 6'0" 174 lbs - Valencia Spanish First Division ESP - ... 46 5 68 93 70 79 0 8.396620 0.663128 85.000000
1231 Maximilian Mittelstädt D/WB (L) 27 5'11" 156 lbs - VfB Stuttgart Bundesliga GER - ... 36 5 61 96 80 86 0 8.666008 1.584613 84.444444
2928 Boy Kemper D (LC), WB (L), DM 24 6'1" 180 lbs Sct NAC Breda Eredivisie NED Trained in nation (15-21) ... 25 34 87 99 48 89 0 8.050326 1.287554 83.444444

5 rows × 215 columns

Defensive Cluster Centroids:
PCA1 PCA2
0 5.895340 1.671089
1 -2.955537 0.469125
2 3.271433 -1.161332
3 0.041772 -1.269763

Defensive Score Metric

In [28]:
def search_player_stats(player_name, defensive_clustered, defensive_metrics):
    """
    Search for a player by name and return their defensive stats and combined defensive score.

    Args:
    player_name (str): Name of the player to search for.
    defensive_clustered (DataFrame): The DataFrame with player clustering data.
    defensive_metrics (list): The list of defensive metrics.

    Returns:
    dict or str: Player's stats and combined score, or message if player is not found.
    """
    # Filter the DataFrame for the player by name (case-insensitive)
    player_data = defensive_clustered[defensive_clustered['Name'].str.contains(player_name, case=False, na=False)]
    
    # If player is found, return their stats and combined score
    if not player_data.empty:
        player_stats = player_data[defensive_metrics + ['Combined Score']].iloc[0].to_dict()
        return player_stats
    return f"Player '{player_name}' not found."

# User input to search for a player
player_name = input("Enter the player's name: ")

# Get player stats and combined score
player_stats = search_player_stats(player_name, defensive_clustered, defensive_metrics)

# Display the result
if isinstance(player_stats, dict):
    print(f"\nStats for {player_name}:")
    # Using list comprehension for concise output
    print("\n".join([f"{stat}: {value}" for stat, value in player_stats.items()]))
else:
    print(player_stats)
Enter the player's name: Phil Foden

Stats for Phil Foden:
Tackles Attempted (percentile): 98.0
Tackles Attempted/90 (percentile): 77.0
Tackles Completed (percentile): 97.0
Tackles Completed/90 (percentile): 69.0
Interceptions (percentile): 89.0
Interceptions/90 (percentile): 42.0
Clearances (percentile): 75.0
Clearances/90 (percentile): 26.0
Blocks (percentile): 82.0
Blocks/90 (percentile): 40.0
Key Tackles (percentile): 0.0
Key Tackles/90 (percentile): 0.0
Tackle Completion % (percentile): 22.0
Defensive Actions/90 (percentile): 39.0
Fouls Committed/90 (percentile): 51.0
Possession Won/90 (percentile): 42.0
Duels Win % (percentile): 15.0
Headers Won % (percentile): 13.0
Combined Score: 48.72222222222222

Cluster Plane search

In [17]:
#pip install fuzzywuzzy python-Levenshtein
In [25]:
import pandas as pd
import numpy as np
from fuzzywuzzy import process
from sklearn.metrics import pairwise_distances
from IPython.display import display

def find_similar_players(player_name, cluster_data, n=5):
    """
    Finds the top `n` most similar players to the given player based on their position
    in the PCA1-PCA2 cluster plane.
    
    Parameters:
    - player_name (str): The name of the player to search for.
    - cluster_data (DataFrame): The DataFrame containing the cluster assignments, PCA1, PCA2, and player names.
    - n (int): The number of similar players to return (default is 5).
    
    Returns:
    - DataFrame: A DataFrame with the top `n` most similar players based on their position in the PCA plane.
    """
    # Ensure the player exists in the data
    correct_name = fuzzy_search(player_name, cluster_data)
    if not correct_name:
        raise ValueError("No close match found for the player name.")

    # Get the player's PCA coordinates
    player_row = cluster_data[cluster_data['Name'] == correct_name]
    player_pca1 = player_row['PCA1'].values[0]
    player_pca2 = player_row['PCA2'].values[0]
    
    # Calculate the Euclidean distance between the player's PCA coordinates and all other players
    other_players = cluster_data[cluster_data['Name'] != correct_name]
    distances = np.sqrt((other_players['PCA1'] - player_pca1)**2 + (other_players['PCA2'] - player_pca2)**2)
    
    # Add distances to the dataframe
    other_players['Distance'] = distances
    
    # Sort by distance and return the top n most similar players
    similar_players = other_players.nsmallest(n, 'Distance')[['Name', 'Distance', 'PCA1', 'PCA2']]
    
    return similar_players, correct_name

def fuzzy_search(query, cluster_data):
    """
    Performs a fuzzy search on the player names and returns the best match.
    
    Parameters:
    - query (str): The name of the player to search for.
    - cluster_data (DataFrame): The DataFrame containing player names.
    
    Returns:
    - str: The closest matching player name, or None if no close match is found.
    """
    # Get the list of player names
    player_names = cluster_data['Name'].tolist()
    
    # Use fuzzywuzzy to find the closest match
    match, score = process.extractOne(query, player_names)
    
    # Only return the match if the score is high enough (e.g., score > 80)
    if score >= 80:
        return match
    else:
        return None

# Example usage:
# Assuming `attacking_clustered` is your DataFrame containing player names and PCA1, PCA2 coordinates.

# Run a dynamic search while the code is running
player_name = input("Enter the player's name: ")  # User input

try:
    similar_players, correct_name = find_similar_players(player_name, attacking_clustered)
    display(f"Did you mean: {correct_name}?")  # Display the player name suggestion
    display(similar_players)  # Display the DataFrame with the top similar players
except ValueError as e:
    display(e)  # Display the error if no match is found
Enter the player's name: Phil Foden
/var/folders/bf/4lyx7mbx6_x3y4pb5d93fm7h0000gp/T/ipykernel_62540/2248603950.py:35: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

'Did you mean: Phil Foden?'
Name Distance PCA1 PCA2
1061 Robert Andrich 0.030948 3.464052 -1.429027
2466 Yannik Engelhardt 0.045184 3.525816 -1.410203
2569 Pelle Clement 0.071949 3.439708 -1.489610
796 Harry Winks 0.075548 3.550945 -1.488922
1941 Isaac Hayden 0.094428 3.586660 -1.449655

Statisitcal Player search

In [19]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.impute import SimpleImputer

def find_similar_players_statistics(player_name, data, n=6):
    """
    Finds the top `n` most similar players based on their statistics.
    
    Parameters:
    - player_name (str): The name of the player to search for.
    - data (DataFrame): The dataset containing player statistics.
    - n (int): The number of similar players to return (default is 5).
    
    Returns:
    - DataFrame: A DataFrame with the top `n` most similar players based on their statistics.
    """
    # Exclude columns that are not relevant for the similarity (e.g., 'Games', 'Minutes Played')
    excluded_columns = ['Games', 'Minutes Played']
    data_relevant = data.drop(columns=excluded_columns, errors='ignore')
    
    # Retrieve the player’s stats by name
    player_stats = data_relevant[data_relevant['Name'] == player_name].iloc[0]
    
    # Ensure all columns in data are present in player_stats (fill missing columns)
    missing_columns = set(data_relevant.columns) - set(player_stats.index)
    for col in missing_columns:
        player_stats[col] = np.nan  # Fill missing columns with NaN
    
    # Convert player statistics to DataFrame for compatibility
    player_stats_df = pd.DataFrame([player_stats])
    
    # Impute missing values (e.g., by filling with the mean of the column)
    imputer = SimpleImputer(strategy='mean')  # or 'median'
    data_imputed = pd.DataFrame(imputer.fit_transform(data_relevant.select_dtypes(include=[np.number])))
    player_stats_imputed = pd.DataFrame(imputer.transform(player_stats_df[data_relevant.select_dtypes(include=[np.number]).columns]))
    
    # Standardize the data (only numeric columns)
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(data_imputed)
    
    # Scale the input player's statistics
    player_stats_scaled = scaler.transform(player_stats_imputed)
    
    # Calculate Euclidean distances between the input player and all other players
    distances = euclidean_distances(player_stats_scaled, data_scaled)
    
    # Add distances to the dataset
    data_relevant['Distance'] = distances[0]
    
    # Sort by distance and return the top n most similar players
    similar_players = data_relevant.nsmallest(n, 'Distance')[['Name', 'Distance'] + list(data_relevant.select_dtypes(include=[np.number]).columns)]
    
    return similar_players

# Example usage: Input player name for comparison
player_name = input("Enter the name of the player you want to search for: ")

# Assuming `data_europe` is your DataFrame containing player statistics
similar_players = find_similar_players_statistics(player_name, data_europe)

print(f"Top 5 Most Similar Players to {player_name}:")
display(similar_players)
Enter the name of the player you want to search for: Erling Haaland
Top 5 Most Similar Players to Erling Haaland:
Name Distance Age Starts Average Rating Sub Appearances Minutes/Game Goals (percentile) Goals/90 (percentile) Minutes/Goal (percentile) ... Finisher Aerial Threat Reader Assister Ball Winning Defenders Cluster PCA1 PCA2 Combined Score Distance
4 Erling Haaland 0.000000 23 43 7.44 0 69.21 100 97 96 ... 99 76 37 91 52 3 1.429782 -1.227910 30.611111 0.000000
6 Harry Kane 7.838198 30 38 7.42 0 80.71 100 97 96 ... 99 77 36 86 53 2 1.333988 -0.957876 31.611111 7.838198
0 Kylian Mbappé 8.319176 25 49 7.60 0 82.69 100 98 97 ... 99 46 29 97 37 2 0.994908 -0.788362 28.777778 8.319176
7 Robert Lewandowski 8.649842 35 49 7.24 0 88.00 100 91 91 ... 98 71 28 88 25 2 1.102815 0.134810 30.222222 8.649842
153 Victor Boniface 8.682437 23 35 7.15 2 69.76 99 94 93 ... 97 74 29 50 28 3 0.754453 -1.242610 28.222222 8.682437
55 Romelu Lukaku 9.040984 31 34 7.16 0 82.50 99 94 93 ... 98 70 32 96 13 2 1.598513 -0.421074 31.555556 9.040984

6 rows × 202 columns

Cluster Search vs Statistical Search Analysis¶

Differences Between Cluster-Based Search and Statistical Search¶

1. Cluster-Based Search:¶

  • Method:

    • Uses PCA (Principal Component Analysis), a dimensionality reduction technique that transforms the original high-dimensional data into a smaller number of components (in this case, PCA1 and PCA2).
  • Basis of Similarity:

    • Similarity is based on the relative positioning of players in the reduced PCA space. Players who are closer in this reduced space (PCA1 and PCA2) are considered more similar.
  • Results:

    • When searching for a player (e.g., Erling Haaland), the system finds players whose PCA1 and PCA2 values are closest to Haaland’s.
    • The similarity here captures broad patterns of player types or styles, rather than specific performance metrics.
    • Global similarity measure based on how players are grouped in the cluster, considering overall playing style, position, or role.

2. Statistical Search:¶

  • Method:

    • This search is based on a player's specific statistics (e.g., goals, assists, minutes played). These statistics are used to compare players across various performance metrics.
    • Similarity is calculated using Euclidean distance (or other distance metrics) between the player's statistical profile and others.
  • Basis of Similarity:

    • The statistical search focuses on individual performance metrics, such as "Goals (percentile)", "xG (percentile)", etc., directly reflecting how a player performs in specific areas.
    • The distance is calculated in a high-dimensional space, where each dimension represents a different statistic.
  • Results:

    • Players are ranked based on how similar their statistical profiles are to the input player (e.g., Erling Haaland).
    • The search prioritizes granular comparisons of actual performance, such as goals per 90 minutes or assists, rather than overall positioning in a reduced space.
    • Players may not share similar styles but can have very similar statistical profiles.

Key Differences in the Results:¶

  • Cluster Search:

    • Focuses on overall style and positioning of players based on PCA components (PCA1 and PCA2).
    • Groups players who have similar playing styles or roles, even if their individual statistics may differ.
    • This similarity measure is more abstract and global.
  • Statistical Search:

    • Focuses on the individual performance metrics, such as goals, assists, and defensive actions.
    • Evaluates players based on how similar their statistical profiles are, offering a more granular comparison.

Why Do the Results Differ?¶

  1. Dimensionality vs. Specificity:

    • PCA reduces the complexity of the data by distilling it into two main components (PCA1 and PCA2). It groups players based on similar styles rather than specific statistics.
    • Statistical search compares the actual performance metrics across multiple individual statistics, providing a more detailed and performance-specific comparison.
  2. Clustered Data Grouping:

    • Players in the PCA-based cluster might cluster together due to similar overall traits (e.g., playing position, role, or team system) even if their individual statistics differ.
    • The statistical search might return players who have similar performance metrics but differ in playing style or position.
  3. Euclidean vs. PCA Distances:
    • The statistical search uses Euclidean distance to compare raw player data in a high-dimensional space. This can capture performance-based similarities, considering many individual metrics simultaneously.
    • The PCA-based search measures distances in a reduced space, focusing on global patterns or styles rather than detailed statistics. It uses the PCA components to capture broader trends.

In Summary:¶

  • Cluster-based similarity measures general positional or style-based similarity, typically grouping players who play similarly or share broad characteristics, regardless of their individual performance stats.
  • Statistical similarity measures how close players are in terms of their actual performance metrics, focusing on detailed attributes like goals, assists, minutes played, and other performance indicators.

These methods answer different kinds of questions:

  • Cluster Search: "Who plays in a similar way to this player?"
  • Statistical Search: "Who performs similarly to this player in key statistical metrics?"

As a result, the results from these two searches will differ because one is based on style or role similarity (cluster search), while the other focuses on actual performance (statistical search).